71 research outputs found

    Is engagement with a purpose the essence of active learning?

    Get PDF
    In the 2009 edition of the conference on “Active Learning in Engineering Education”, there were several and fruitful discussions within a small workgroup about the essence of active learning. At the end we came with an attempt to sum up our whole discussion with one question. Our question is the same as the title of this essay. Taking this question as a starting point this article propose a specific purpose from which active learning can be based.Peer Reviewe

    Low-power high-efficiency video decoding using general purpose processors

    Get PDF
    In this article, we investigate how code optimization techniques and low-power states of general-purpose processors improve the power efficiency of HEVC decoding. The power and performance efficiency of the use of SIMD instructions, multicore architectures, and low-power active and idle states are analyzed in detail for offline video decoding. In addition, the power efficiency of techniques such as “race to idle” and “exploiting slack” with DVFS are evaluated for real-time video decoding. Results show that “exploiting slack” is more power efficient than “race to idle” for all evaluated platforms representing smartphone, tablet, laptop, and desktop computing systems

    On latency in GPU throughput microarchitectures

    Get PDF
    Modern GPUs provide massive processing power (arithmetic throughput) as well as memory throughput. Presently, while it appears to be well understood how performance can be improved by increasing throughput, it is less clear what the effects of micro-architectural latencies are on the performance of throughput-oriented GPU architectures. In fact, little is publicly known about the values, behavior, and performance impact of microarchitecture latency components in modern GPUs. This work attempts to fill that gap by analyzing both the idle (static) as well as loaded (dynamic) latency behavior of GPU microarchitectural components. Our results show that GPUs are not as effective in latency hiding as commonly thought and based on that, we argue that latency should also be a GPU design consideration besides throughput

    Spatio-temporal SIMT and scalarization for improving GPU efficiency

    Get PDF
    Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP

    Application-Specific Cache and Prefetching for HEVC CABAC Decoding

    Get PDF
    Context-based Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module in the HEVC/H.265 video coding standard. As in its predecessor, H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Besides other optimizations, the replacement of the context model memory by a smaller cache has been proposed for hardware decoders, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. This work fills the gap by performing an extensive evaluation of different cache configurations. Furthermore, it demonstrates that application-specific context model prefetching can effectively reduce the miss rate and increase the overall performance. The best results are achieved with two cache lines consisting of four or eight context models. The 2 × 8 cache allows a performance improvement of 13.2 percent to 16.7 percent compared to a non-cached decoder due to a 17 percent higher clock frequency and highly effective prefetching. The proposed HEVC/H.265 CABAC decoder allows the decoding of high-quality Full HD videos in real-time using few hardware resources on a low-power FPGA.EC/H2020/645500/EU/Improving European VoD Creative Industry with High Efficiency Video Delivery/Film26

    A low-complexity parallel-friendly rate control algorithm for ultra-low delay high definition video coding

    Get PDF
    Ultra-low delay high definition (HD) video coding applications such as video conferencing demand, first, low-complexity video encoders able to support multi-core framework for parallel processing and, second, rate control algorithms (RCAs) for successful video content delivering under delay constraints. In this paper a low-complexity parallel-friendly RCA is proposed for HD video conferencing. Specifically, it has been implemented on an optimized H.264/Scalable Video Coding (SVC) encoder, providing excellent performance in terms of buffer control, while achieving acceptable quality of compressed video under the imposed delay constraints

    Syntax Element Partitioning for high-throughput HEVC CABAC decoding

    Get PDF
    Encoder and decoder implementations of the High Efficiency Video Coding (HEVC) standard have been subject to many optimization approaches since the release in 2013. However, the real-time decoding of high quality and ultra high resolution videos is still a very challenging task. Especially entropy decoding (CABAC) is most often the throughput bottleneck for very high bitrates. Syntax Element Partitioning (SEP) has been proposed for the H.264/AVC video compression standard to address this issue and the limitations of other parallelization techniques. Unfortunately, it has not been adopted in the latest video coding standard, although it allows to multiply the throughput in CABAC decoding. We propose an improved SEP scheme for HEVC CABAC decoding with eight syntax element partitions. Experimental results show throughput improvements up to 5.4× with negligible bitstream overhead, making SEP a useful technique to address the entropy decoding bottleneck in future video compression standards

    An Optimized Parallel IDCT on Graphics Processing Units

    Get PDF
    In this paper we present an implementation of the H.264/AVC Inverse Discrete Cosine Transform (IDCT) optimized for Graphics Processing Units (GPUs) using OpenCL. By exploiting that most of the input data of the IDCT for real videos are zero valued coefficients a new compacted data representation is created that allows for several optimizations. Experimental evaluations conducted on different GPUs show average speedups from 1.7× to 7.4× compared to an optimized single-threaded SIMD CPU version

    Optimizing HEVC CABAC decoding with a context model cache and application-specific prefetching

    Get PDF
    Context-based Adaptive Binary Arithmetic Coding is the entropy coding module in the most recent JCT-VC video coding standard HEVC/H.265. As in the predecessor H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Beside other optimizations, the replacement of the context model memory by a smaller cache has been proposed, resulting in an improved clock frequency. However, the effect of potential cache misses has not been properly evaluated. Our work fills this gap and performs an extensive evaluation of different cache configurations. Furthermore, it is demonstrated that application-specific context model prefetching can effectively reduce the miss rate and make it negligible. Best overall performance results were achieved with caches of two and four lines, where each cache line consists of four context models. Four cache lines allow a speed-up of 10% to 12% for all video configurations while two cache lines improve the throughput by 9% to 15% for high bitrate videos and by 1% to 4% for low bitrate videos.EC/H2020/645500/EU/Improving European VoD Creative Industry with High Efficiency Video Delivery/Film26

    Parallel H.264/AVC motion compensation for GPUs using OpenCL

    Get PDF
    Motion compensation is one of the most compute-intensive parts in H.264/AVC video decoding. It exposes massive parallelism, which can reap the benefit from graphics processing units (GPUs). Control and memory divergence, however, may lead to performance penalties on GPUs. In this paper, we propose two GPU motion-compensation kernels, implemented with OpenCL, that mitigate the divergence effect. In addition, the motion-compensation kernels have been integrated into a complete and optimized H.264/AVC decoder that supports high-profile H.264/AVC. We evaluated our kernels on GPUs with different architectures from AMD, Intel, and Nvidia. Compared with the fastest CPU used in this paper, our kernel achieves 2.0 speedup on a discrete Nvidia GPU at kernel level. However, when the overheads of memory copy and OpenCL runtime are included, no speedup is gained at application level.EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP
    • 

    corecore